The dataset used in the project include 2016 Green Taxi Trip Data (.csv) which is available from https://data.cityofnewyork.us/Transportation/2016-Green-Taxi-Trip-Data/hvrh-b6nb).
The main topic of this report is the exploration of mobility patterns of NYC taxi rides using spatial clustering method. Specific motivation, methodology and modeling results are discussed separately in the rest of the report.
Data mining in the spatial aspects may provide useful insights into the urban mobility analysis. In the third part, the core motivation of this work is to detect the hot-spot regions of taxi rides and explore the travel mobility patterns in New York City based on the NYC open dataset regarding taxi trips. Typically, spatial clustering by K-Means is utilized to partition the geographical coordinates of the pick-up and drop-off locations of taxi trips into clusters as regions, to formalize locational similarity of these spatial objects.
The dataset 2016_Green_Taxi_Trip_Data.csv continues to be used in this part. The features regarding the taxi trips including the pick-up and drop-off locations, trip distances and passenger count are mainly involved in the analysis. The time period of the taxi trips covers from 01/01/2016 00:00:00 to 01/02/2016 00:00:00, which corresponds to a whole four-week period in January. In the analysis of morning-rush-hour travel patterns in this report, taxi trips generated during the morning rush hours ranging from 6-10 am on weekdays of these four weeks are extracted to prepare for further analysis.
To involve the K-Means method in spatial clustering, the pre-definition of the initial cluster centers or the number of clusters is crucial. Unfortunately, it can be highly difficult to reasonably define the initial conditions for such a large dataset that we used. However, from our prior knowledge, similar trips are more likely to be generated in the same neighborhood due to the community effect. This kind of effects can also be seen in many other aspects such as housing prices, income levels, car ownership, transport mode choices and many others. In this work, we collected the NYC neighborhood data which are publically available from https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page, and extracted the centroids of neighborhood regions to be the initial clustering centers. The visualization of the neighborhood distribution is shown in Figure 3.1.
Figure 3.1 NYC neighborhood Distribution
The steps of the K-Means algorithm are as follows:
Select the initial K samples as the initial clustering center \(a=a_1, a_2, ..., a_k\);
For each sample \(x_i\) in the dataset, compute the distance from \(x_i\) to the K clustering centers, and assign each sample \(x_i\) into the cluster corresponding to the clustering center nearest to \(x_i\);
For each cluster \(a_j\), recalculate its clustering center \(a_{j}=\frac{1}{\left | c_i \right |}\sum_{x\in c_i}^{}x\) (that is, the centroid of all samples belonging to the cluster);
Repeat the above two steps until a certain stop condition (iteration number, minimum error change, etc.) is reached.
The K-Means algorithm is only applied to applications under the Euclidean distance metric, which fits the spatial clustering in this work well. The initial condition for K-Means clustering is given by the geographical coordinates of the NYC neighborhood centroids.
To eliminate the noises from the raw data, several cleaning transformations have been conducted:
The taxi trips should be generated within the spatial coverage of NYC. (The coordinates of pick-ups and drop-offs should within x_lim = c(-74.25, -73.75) and y_lim =c(40.5, 40.95)).
The passenger count of each taxi should be no more than four(Passenger_count < 5).
The trip-duration time should be more than 0 seconds. (The pick-up and drop-off time in local time format is transformed to the standard time format and the trip-duration time is mutated and checked)
The hour and weekday information of pick-up and drop-off time for each trip are extracted for further exploration in travel mobility during rush hours.
Each taxi trips are separated as pick-up spatial points and drop-off spatial points for further manipulation.
The NYC neighborhood data are represented as polygons in the shapefile format. The coordinates of the centroid of each neighborhood region are extracted using spatial processing and organized in spatial point format. The map of neighborhood centroid distribution is shown in Figure 3.2.
Figure 3.2 Map of NYC neighborhood centroids
Before the spatial clustering, the spatial distribution of spatial points of Pick-up locations and drop-off Locations are pre-checked, which are shown in Figure 3.3 and Figure 3.4.
Considering the fact that it is time-consuming to plot Figure 3.3-3.5 in generating HTML, we use images output by the scripts instead of plotting the 1.2 million points every time when we want to knit the rmd
Figure 3.3 Distribution of pick-up locations of taxi rides
Figure 3.4 Distribution of drop-off locations of taxi rides
From the distribution map of pick-up locations, it can be observed that there are no pick-up rides within the Manhattan island. This is due to the operating restriction of green taxes which are only permitted to accept street-hails in the Hail Exclusionary Zone, south of West 110th St and East 96th St. Based on the unbalance of pick-up rides and drop-off rides within the Manhattan island, we would separate the pick-up locations and drop-off locations and only involve the pick-up data to explore the spatial variations within the Manhattan island.
In urban areas, functional regions such as business districts, hospitals, and big communities, are often hot-spot regions featured in high-level travel demands for taxi rides. Identification and characterization of the urban hot-spot regions of taxi rides may help to discover interesting and potentially useful patterns for urban planning. To achieve this, spatial clustering based on the pick-up and drop-off locations are conducted. The centroids of NYC neighborhoods are set as the initial clustering centers. The locational similarity of taxi trips are visualized in Figure 3.5.
Figure 3.5 Distribution of taxi-ride clusters
It can be seen from Figure 3.5 that the clustering results in a partition which is somewhat similar to the way NY is divided into different neighborhoods. The areas newly split by taxi trip clusters may reflect more reasonable delimitations of neighborhood boundaries.
However, it can still be difficult to identify which parts are the hot-spot regions of taxi rides (blue circles with a larger radius) as the spatial points of most clusters are densely-distributed. To visualize the different levels of taxi-trip numbers within each cluster, we use the coordinates of the cluster centers on the map to represent each cluster and circles to represent the relative cluster sizes. The radius of each circles visualized in the map only represents the relative relationship between different clusters. Some of the circles may overlap with others, but the clusters they represent are actually non-overlapping. We use the leaflet package and the OpenStreet base-map to visualize the cluster central locations and relative sizes, which may help to refer the clusters with land-use types in New York City. The result is shown in Figure 3.6.
Figure 3.6 Centroids and relative sizes of taxi-trip clusters
As we can see, four regions including the Upper East of Central park, the Upper West of Brooklyn, and the Astoria and the Jackson Heights are the typical hot-spot regions of taxi rides with clusters in higher density. People in these areas may be in higher needs for taxi rides and these regions may have higher levels of travel mobility.
However, the taxi rides in the Manhattan island cannot be correctly demonstrated due to the permission limitation of pick-up rides within this area. Hence, we separately involve the drop-off locations to compare the mobility level of the Manhattan island with other NYC local regions.The result is shown in Figure 3.7.
Figure 3.7 Centroids and relative sizes of drop-off clusters
From Figure 3.7, we can see that the drop-off cluster sizes within the Manhattan island is still in relative small sizes, which is unexpectedly low considering its dense population density. This may be explained by its unique island topography and the densely-distributed subway system which make people tend to take the public transport so as to avoid heavy traffic jams.
To further compare the local mobility pattern over different local regions,the pick-up and drop-off locations of taxi rides during the morning rush hours from 6 am to 10 am on weekdays are extracted and clustered into different spatial groups. The cluster central locations and relative sizes are shown in Figure 3.8. The green circles and violet circles represent the pick-up clusters and drop-off clusters respectively.